Legend: df = dataframe
pd = pandas
pd.read_csv("file.csv")
df.describe()
<- Very Useful
df.columns
<- Read Headers (names of each column)
^- Output: Index(['text'], dtype='object')
Quick Previews
df.head("3")
df.tail("2")
Headers
df['text']
<- Read each columns in the mentioned column namedf['text1', 'text2']
Reading multiple columnsRows
df.iloc[1]
<- Reading the first rowdf.iloc[1,2]
<- Read row ∩ column
Iterating to each row
for index, row in df.iterrows():
print(index, row['Name'])
df.loc[df['Type 1'] == "Grass]
df.sort_values(ascending=[False])
df.sort_values(['Type 1, hp'])
<- Multiple columns are allowed Data Frames
(looks like an excel table)
We can think of Data Frames as a combination of multiple series
index=[]
is basically the row
columns is columns
import pandas as pd
certificates_earned = pd.DataFrame({
'Certificates': [8, 2, 5, 6],
'Time (in months)': [16, 5, 9, 12]
})
names = ['Tom', 'Kris', 'Ahmad', 'Beau']
certificates_earned.index = names
isna()
, notna()
, and dropna()
.Functions: isna()
notna()
dropna()
Attributes: s.isna()
s.notna()
s.dropna()
dropna()
removes them.The dropna
function can remove rows or columns with missing values, and you can specify axis and thresholds.
pd.isnull(np.nan)
Pandas library's isnull
function to check if a value is null
np.nan
- (Not A Number)
Why would there be an "np.nan"?
np.nan
is recognized by various libraries within the scientific Python ecosystem, including Pandas, SciPy, and scikit-learn. This makes it easier to work with missing data across different tools.Data frames can be analyzed using methods like info
and shape
to understand structure and missing values.
Syntax: DataFrame.fillna(value, *method=ffill, bfill*)
fillna
method can replace missing values with specific values By default, it returns a new DataFrame with filled values.
Method Values:
ffill
- copies the current value to it's forward's missing value in the same COLUMN
bfill
- carries the current value to it's backward's missing value in the same COLUMN
Categorical column cleaning involves using unique
or value_counts
to identify invalid values, followed by replacing or fixing them.
For more complex fixes, coding skills might be required, such as when handling ages with typographical errors.
Duplicates are a common concern in data analysis
Require defining what constitutes a duplicate value.
The Dataframe.duplicated()
method in pandas helps identify duplicate values based on specified rules.
subset=[]
attribute is used to narrow down selection # Check for duplicated rows based on specific columns
duplicates_subset = df.duplicated(subset=['Name', 'Age'])
Returns BOOLEAN Values:
True
: Specifically, it marks an element as True
if it's the same as a previous element/sFalse
: If current value isn't a duplicate from the previous element/sDataframe.drop_duplicates()
removes duplicate rows from DataFrame based on certain criteria.Keep Parameters for attributes:
-> keep='first, last, false'
)-> -> first occurrence (default), last occurrence, ALL DUPLICATES -> -> can be put as parameter to
drop_duplicates()and
duplicated()`
Created: 2024-03-03